Fame of Programming Languages

Get the Data

Either use the provided .csv file or (optionally) get fresh (the freshest?) data from running an SQL query on StackExchange:

Follow this link to run the query from StackExchange to get your own .csv file

select dateadd(month, datediff(month, 0, q.CreationDate), 0) m, TagName, count(*) from PostTags pt join Posts q on q.Id=pt.PostId join Tags t on t.Id=pt.TagId where TagName in ('java','c','c++','python','c#','javascript','assembly','php','perl','ruby','visual basic','swift','r','object-c','scratch','go','swift','delphi') and q.CreationDate < dateadd(month, datediff(month, 0, getdate()), 0) group by dateadd(month, datediff(month, 0, q.CreationDate), 0), TagName order by dateadd(month, datediff(month, 0, q.CreationDate), 0)

Import Statements

Data Exploration

Challenge: Read the .csv file and store it in a Pandas dataframe

Challenge: Examine the first 5 rows and the last 5 rows of the of the dataframe

Challenge: Check how many rows and how many columns there are. What are the dimensions of the dataframe?

Challenge: Count the number of entries in each column of the dataframe

Challenge: Calculate the total number of post per language. Which Programming language has had the highest total number of posts of all time?

Some languages are older (e.g., C) and other languages are newer (e.g., Swift). The dataset starts in September 2008.

Challenge: How many months of data exist per language? Which language had the fewest months with an entry?

Data Cleaning

Let's fix the date format to make it more readable. We need to use Pandas to change format from a string of "2008-07-01 00:00:00" to a datetime object with the format of "2008-07-01"

Data Manipulation

Challenge: What are the dimensions of our new dataframe? How many rows and columns does it have? Print out the column names and print out the first 5 rows of the dataframe.

Challenge: Count the number of entries per programming language. Why might the number of entries be different?

Data Visualisaton with with Matplotlib

Challenge: Use the matplotlib documentation to plot a single programming language (e.g., java) on a chart.

Challenge: Show two line (e.g. for Java and Python) on the same chart.

Smoothing out Time Series Data

Time series data can be quite noisy, with a lot of up and down spikes. To better see a trend we can plot an average of, say 6 or 12 observations. This is called the rolling mean. We calculate the average in a window of time and move it forward by one overservation. Pandas has two handy methods already built in to work this out: rolling() and mean().